16 research outputs found

    Detecting clusters and their dynamics in the Forex Market

    Get PDF
    This project studies and implements the clustering methods introduced by Fenn et al. to detect correlations in the foreign exchange market. To deal with the potentially non linear nature of currency time series dependance, we propose two alternative similarity metrics to use instead of the Pearson linear correlation. We observe how each of them responds over several years of currency exchange data and find significant differences in the resulting clusters

    On methods to assess the significance of community structure in networks of financial time series

    Get PDF
    We consider the problem of determining whether the community structure found by a clustering algorithm applied to financial time series is statistically significant, when no other information than the observed values and a similarity measure among time series is available. We propose two raw-data based methods for assessing robustness of clustering algorithms on time-dependent data linked by a relation of similarity: One based on community scoring functions that quantify some topological property that characterizes ground-truth communities, the other based on random perturbations and quantification of the variation in the community structure. These methodologies are well-established in the realm of unweighted networks; our contribution are versions adapted to complete weighted networks. We reinforce our assessment of the accuracy of the clustering algorithm by testing its performance on synthetic ground-truth communities of time series built through Monte Carlo simulations of VARMA processes

    Clustering of exchange rates and their dynamics under different dependence measures

    Get PDF
    This paper proposes an improvement to the method for clustering exchange rates given by D. J. Fenn et al, in Quantitative Finance, 12 (10) 2012, pp.1493-1520. To deal with the potentially non linear nature of currency time series dependence, we propose two alternative similarity metrics to use instead of the one used in the aforementioned paper based on Pearson correlation. Our proposed similarity metrics are based upon Kendall and distance correlations. We observe how each of the newly adapted clustering methods respond over several years of currency exchange data and find significant differences in the resulting clusters.Peer ReviewedPostprint (published version

    On methods to assess the significance of community structure in networks of financial time series

    Get PDF
    We consider the problem of determining whether the community structure found by a clustering algorithm applied to nancial time series is statistically signi cant, or is due to pure chance, when no other information than the observed values and a similarity measure among time series are available. As a subsidiary problem we also analyse the in uence of the choice of similarity measure in the accuracy of the clustering method. We propose two raw-data based methods for assessing robustness of clustering algorithms on time-dependent data linked by a relation of similarity: One based on community scoring functions that quantify some topological property that characterises ground-truth communities, and another based on random perturbations and quanti cation of the variation in the community structure. These methodologies are well-established in the realm of unweighted networks; our contribution are versions of these methodologies properly adapted to complete weighted networks.Peer ReviewedPostprint (published version

    Towards and efficient algorithm for computing the reduced mutual information

    Get PDF
    Newman et al. introduced the Reduced Mutual Information (RMI), a measure of the similarity between two partitions of a set useful in clustering and community detection. The computation of RMI requires counting the amount of contingency tables with fixed row and column sums, a #P-complete problem, for which the authors suggest to use analytical approximations that work in general, but for other not so pathological cases these give highly inaccurate approximations. We propose a hybrid scheme based on combining existing Markov chain Monte Carlo methods with analytical approximations to make more accurate estimates of the number of contingency tables in all cases.Peer ReviewedPostprint (published version

    Identifying bias in cluster quality metrics

    Get PDF
    We study potential biases of popular cluster quality metrics, such as conductance or modularity. We propose a method that uses both stochastic and preferential attachment block models construction to generate networks with preset community structures, to which quality metrics will be applied. These models also allow us to generate multi-level structures of varying strength, which will show if metrics favour partitions into a larger or smaller number of clusters. Additionally, we propose another quality metric, the density ratio. We observed that most of the studied metrics tend to favour partitions into a smaller number of big clusters, even when their relative internal and external connectivity are the same. The metrics found to be less biased are modularity and density ratio.Preprin

    Identifying bias in network clustering quality metrics

    Get PDF
    We study potential biases of popular network clustering quality metrics, such as those based on the dichotomy between internal and external connectivity. We propose a method that uses both stochastic and preferential attachment block models construction to generate networks with preset community structures, and Poisson or scale-free degree distribution, to which quality metrics will be applied. These models also allow us to generate multi-level structures of varying strength, which will show if metrics favour partitions into a larger or smaller number of clusters. Additionally, we propose another quality metric, the density ratio. We observed that most of the studied metrics tend to favour partitions into a smaller number of big clusters, even when their relative internal and external connectivity are the same. The metrics found to be less biased are modularity and density ratioPeer ReviewedPostprint (published version

    The assessment of clustering on weighted network with R package clustAnalytics

    Get PDF
    We present clustAnalytics, an R package available now on CRAN, which provides methods to validate the results of clustering algorithms on unweighted and weighted networks, particularly for the cases where the existence of a community structure is unknown. clustAnalytics comprises a set of criteria for assessing the significance and stability of a clustering. To evaluate clusters’ significance, clustAnalytics provides a set of community scoring functions, and systematically compares their values to those of a suitable null model. For this it employs a switching model to produce randomized graphs with weighted edges. To test for clusters’ stability, a non parametric bootstrap method is used, together with similarity metrics derived from information theory and combinatorics. In order to assess the effectiveness of our clustering quality evaluation methods, we provide methods to synthetically generate networks (weighted or not) with a ground truth community structure based on the stochastic block model construction, as well as on a preferential attachment model, the latter producing networks with communities and scale-free degree distribution.Peer ReviewedPostprint (published version

    Clustering assessment in weighted networks

    Get PDF
    We provide a systematic approach to validate the results of clustering methods on weighted networks, in particular for the cases where the existence of a community structure is unknown. Our validation of clustering comprises a set of criteria for assessing their significance and stability. To test for cluster significance, we introduce a set of community scoring functions adapted to weighted networks, and systematically compare their values to those of a suitable null model. For this we propose a switching model to produce randomized graphs with weighted edges while maintaining the degree distribution constant. To test for cluster stability, we introduce a non parametric bootstrap method combined with similarity metrics derived from information theory and combinatorics. In order to assess the effectiveness of our clustering quality evaluation methods, we test them on synthetically generated weighted networks with a ground truth community structure of varying strength based on the stochastic block model construction. When applying the proposed methods to these synthetic ground truth networks’ clusters, as well as to other weighted networks with known community structure, these correctly identify the best performing algorithms, which suggests their adequacy for cases where the clustering structure is not known. We test our clustering validation methods on a varied collection of well known clustering algorithms applied to the synthetically generated networks and to several real world weighted networks. All our clustering validation methods are implemented in R, and will be released in the upcoming package clustAnalytics

    Cluster evaluation on weighted networks

    Get PDF
    (English) This thesis presents a systematic approach to validate the results of clustering methods on weighted networks, particularly for the cases where the existence of a community structure is unknown. Including edge weights has many applications in network science, as there are many situations in which the strength of the connections between nodes is an essential property that describes the network. This evaluation of clustering methods comprises a set of criteria for assessing their significance and stability. First, a well-established set of community scoring functions, which already existed for unweighted graphs, has been extended to the case where the edges have associated weights. There is consideration given to how in some cases many possible weighted extensions to the same function can be defined, and each of them can suit different types of weighted networks. Additionally, methods to randomize graphs but maintaining the original graph’s degree distribution have been defined in order to use these random graphs as baseline networks. This randomization together with the weighted community scoring functions are then used to evaluate cluster significance, since the random networks built from the original network with our methods provide reference values for each scoring function that will allow to actually determine whether a given cluster score for the original graph is better than a comparable graph with the same degree distribution but no community structure. As for the evaluation of stability, we define non parametric bootstrap methods with perturbations for weighted graphs where vertices are resampled multiple times, and the perturbations are applied to the edge weights. This, together with some fundamental similarity metrics for set partitions derived from information theory and combinatorics, constitutes our criteria for clustering stability. These criteria are based on the essential idea that meaningful clusters should capture an inherent structure in the data and not be overly sensitive to small or local variations, or the particularities of the clustering algorithm. A more in-depth study of the characteristics of cluster scoring functions and their potential bias towards clusters of a certain size has also been performed. This would render some of these functions unsuitable to compare results of clustering algorithms when the size of the partition differs considerably. For this analysis, we introduce parametrized multi-level ground truth models based on the stochastic block model and on preferential attachment that can showcase how the functions respond to varying the strength of each level of clusters in a hierarchical structure. Additionally, a scoring function that doesn't suffer from this kind of bias is proposed: the density ratio. This thesis also contributes with an efficient implementation of Newman's Reduced Mutual Information, a measure to compare set partitions based on information theory. Here it is used as a tool to compare network partitions, which is particularly useful for the evaluation of cluster stability, but it can have applications beyond the field of network clustering. Our algorithm uses an hybrid approach that combines analytical approximation with a Markov Chain Monte Carlo method for a good balance between accuracy and efficiency. Also an indispensable part of this thesis is the associated software that we developed, which includes the implementation of all the methods discussed in it. It all has all been included in our R package clustAnalytics. This package is designed to work together with igraph, the main R package dedicated to graphs, to make it easy and straightforward for other researchers to use. There are many useful applications for these tools: from the study and observation of new datasets, to the evaluation and benchmarking of clustering algorithms.(Català) Aquesta tesi presenta un mètode sistemàtic per validar els resultats obtinguts per mètodes de clústering a xarxes amb pesos, especialment per als casos en què es desconeix l'existència d'una estructura de comunitats. La inclusió de pesos a les arestes té diverses aplicacions a l'estudi de xarxes, ja que hi ha moltes situacions on la força de les connexions entre nodes és una propietat essencial que descriu la xarxa. Aquesta avaluació de mètodes de clústering inclou una sèrie de criteris per quantificar la seva significació i estabilitat. En primer lloc, hem estès un conjunt de funcions per avaluar comunitats, que ja existien per a grafs sense pesos, al cas on les arestes tenen pesos associats. S'ha tingut en compte com en alguns casos es poden definir diverses extensions de la mateixa funció, cada una per a diferents tipus de xarxes amb pesos. A més a més, s'han definit mètodes per aleatoritzar grafs mantenint la seqüència de graus original, per fer servir aquests grafs aleatoris de referència. Aquest procés, juntament amb les funcions definides, permet avaluar la significació dels clústers, ja que els grafs aleatoris donen valors de referència que serveixen per determinar si la puntuació del graf original és millor que la d'un de comparable però que no tingui una estructura de comunitats. Pel que fa a l'avaluació de l'estabilitat, definim mètodes de bootstrap no paramètric amb pertorbacions per a grafs amb pesos on els vèrtexs es remostregen diverses vegades, i s'apliquen pertorbacions al pes de les arestes. Conjuntament amb mesures de similitud per a particions de conjunts basades en teoria de la informació i combinatòria, formen els criteris per avaluar l'estabilitat dels clústers. Aquests criteris es basen en la idea que els clústers rellevants haurien de captar l'estructura de les dades, però no ser excessivament sensibles a petites variacions locals o a les particularitats dels algoritmes de clústering. També es fa un estudi més a fons de les característiques de les funcions avaluadores dels clústers i el seu possible biaix cap a clústers d'una certa mida. Això podria fer que algunes d'aquestes funcions fossin inadequades per comparar resultats d'algoritmes de clústering en cas que la mida de les particions fos prou diferent. Per aquesta anàlisi, introduïm models parametritzats de comunitats multinivell basats en el model de blocs estocàstics i en el model de connexió preferencial, que mostren com les funcions responen quan la força relativa dels diversos nivells de clusters varia. A més a més, proposem una funció avaluadora que no té aquesta mena de biaix, la ràtio de densitats. Aquesta tesi també aporta una implementació eficient de la informació mútua reduïda ("reduced mutual information") de Newman, una mesura per comparar particions de conjunts basada en teoria de la informació. Aquí es fa servir per comparar particions de xarxes, que és especialment útil per mesurar l'estabilitat dels clusters, però pot tenir aplicacions més enllà del clústering de xarxes. El nostre algoritme té un funcionament híbrid que combina una aproximació analítica amb un mètode de Monte Carlo de cadena de Markov per trobar un bon equilibri entre exactitud i eficiència. Una altra part essencial d'aquesta tesi és el software associat que s'ha desenvolupat, que inclou les implementacions de tots els mètodes que s'hi discuteixen. S'ha compilat tot al nou paquet clustAnalytics, que ja és al repositori de CRAN. Aquest paquet està fet per funcionar conjuntament amb igraph, el principal paquet d'R dedicat als grafs, per fer-lo fàcil i accessible d'utilitzar a altres investigadors. Hi ha moltes situacions on aquestes eines poden ser útils: des de l'estudi i observació de nous conjunts de dades, a l'avaluació d'algoritmes de clústering. A més a més, algunes parts del paquet, com ara la implementació de la informació mútua reduïda es poden fer servir per comparar particions de tota mena de conjunts, no només de xarxes.DOCTORAT EN COMPUTACIÓ (Pla 2012
    corecore